Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuxin Hu

GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

Jan 08, 2026

Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu

Abstract:The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

Via

Access Paper or Ask Questions

GeoDiff-SAR: A Geometric Prior Guided Diffusion Model for SAR Image Generation

Jan 07, 2026

Fan Zhang, Xuanting Wu, Fei Ma, Qiang Yin, Yuxin Hu

Abstract:Synthetic Aperture Radar (SAR) imaging results are highly sensitive to observation geometries and the geometric parameters of targets. However, existing generative methods primarily operate within the image domain, neglecting explicit geometric information. This limitation often leads to unsatisfactory generation quality and the inability to precisely control critical parameters such as azimuth angles. To address these challenges, we propose GeoDiff-SAR, a geometric prior guided diffusion model for high-fidelity SAR image generation. Specifically, GeoDiff-SAR first efficiently simulates the geometric structures and scattering relationships inherent in real SAR imaging by calculating SAR point clouds at specific azimuths, which serves as a robust physical guidance. Secondly, to effectively fuse multi-modal information, we employ a feature fusion gating network based on Feature-wise Linear Modulation (FiLM) to dynamically regulate the weight distribution of 3D physical information, image control parameters, and textual description parameters. Thirdly, we utilize the Low-Rank Adaptation (LoRA) architecture to perform lightweight fine-tuning on the advanced Stable Diffusion 3.5 (SD3.5) model, enabling it to rapidly adapt to the distribution characteristics of the SAR domain. To validate the effectiveness of GeoDiff-SAR, extensive comparative experiments were conducted on real-world SAR datasets. The results demonstrate that data generated by GeoDiff-SAR exhibits high fidelity and effectively enhances the accuracy of downstream classification tasks. In particular, it significantly improves recognition performance across different azimuth angles, thereby underscoring the superiority of physics-guided generation.

* 22 pages, 17 figures

Via

Access Paper or Ask Questions

SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

Jan 05, 2026

Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuhan Liu, Zongxu Pan(+1 more)

Abstract:Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model. Specifically, we design a Structure-Aware Adapter to extract hierarchical structural representations from both modalities and dynamically inject them into the ViT to compensate for structural degradation inherent in ViT-based backbones. Furthermore, we propose a Language-Guided Modulation module that exploits VLM-driven structured captions to dynamically recalibrate visual features, thereby endowing the model with robust environmental awareness. Extensive experiments on the LLVIP, FLIR, KAIST, and DroneVehicle datasets demonstrate that SLGNet establishes new state-of-the-art performance. Notably, on the LLVIP benchmark, our method achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning. This confirms SLGNet as a robust and efficient solution for multimodal perception.

Via

Access Paper or Ask Questions

VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

Dec 19, 2024

Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P. Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M. Wolff, Xin Huang

Figure 1 for VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

Figure 2 for VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

Figure 3 for VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

Figure 4 for VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

Abstract:Human drivers rely on commonsense reasoning to navigate diverse and dynamic real-world scenarios. Existing end-to-end (E2E) autonomous driving (AD) models are typically optimized to mimic driving patterns observed in data, without capturing the underlying reasoning processes. This limitation constrains their ability to handle challenging driving scenarios. To close this gap, we propose VLM-AD, a method that leverages vision-language models (VLMs) as teachers to enhance training by providing additional supervision that incorporates unstructured reasoning information and structured action labels. Such supervision enhances the model's ability to learn richer feature representations that capture the rationale behind driving patterns. Importantly, our method does not require a VLM during inference, making it practical for real-time deployment. When integrated with state-of-the-art methods, VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.

Via

Access Paper or Ask Questions

Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models

Oct 02, 2024

Yunhao Yang, Yuxin Hu, Mao Ye, Zaiwei Zhang, Zhichao Lu, Yi Xu, Ufuk Topcu, Ben Snyder

Figure 1 for Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models

Figure 2 for Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models

Figure 3 for Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models

Figure 4 for Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models

Abstract:Multimodal foundation models offer promising advancements for enhancing driving perception systems, but their high computational and financial costs pose challenges. We develop a method that leverages foundation models to refine predictions from existing driving perception models -- such as enhancing object classification accuracy -- while minimizing the frequency of using these resource-intensive models. The method quantitatively characterizes uncertainties in the perception model's predictions and engages the foundation model only when these uncertainties exceed a pre-specified threshold. Specifically, it characterizes uncertainty by calibrating the perception model's confidence scores into theoretical lower bounds on the probability of correct predictions using conformal prediction. Then, it sends images to the foundation model and queries for refining the predictions only if the theoretical bound of the perception model's outcome is below the threshold. Additionally, we propose a temporal inference mechanism that enhances prediction accuracy by integrating historical predictions, leading to tighter theoretical bounds. The method demonstrates a 10 to 15 percent improvement in prediction accuracy and reduces the number of queries to the foundation model by 50 percent, based on quantitative evaluations from driving datasets.

Via

Access Paper or Ask Questions

Artificial Intelligence for Neuro MRI Acquisition: A Review

Jun 10, 2024

Hongjia Yang, Guanhua Wang, Ziyu Li, Haoxiang Li, Jialan Zheng, Yuxin Hu, Xiaozhi Cao, Congyu Liao, Huihui Ye, Qiyuan Tian

Abstract:Magnetic resonance imaging (MRI) has significantly benefited from the resurgence of artificial intelligence (AI). By leveraging AI's capabilities in large-scale optimization and pattern recognition, innovative methods are transforming the MRI acquisition workflow, including planning, sequence design, and correction of acquisition artifacts. These emerging algorithms demonstrate substantial potential in enhancing the efficiency and throughput of acquisition steps. This review discusses several pivotal AI-based methods in neuro MRI acquisition, focusing on their technological advances, impact on clinical practice, and potential risks.

* Submitted to MAGMA for review

Via

Access Paper or Ask Questions

SRDTI: Deep learning-based super-resolution for diffusion tensor MRI

Feb 17, 2021

Qiyuan Tian, Ziyu Li, Qiuyun Fan, Chanon Ngamsombat, Yuxin Hu, Congyu Liao, Fuyixue Wang, Kawin Setsompop, Jonathan R. Polimeni, Berkin Bilgic(+1 more)

Figure 1 for SRDTI: Deep learning-based super-resolution for diffusion tensor MRI

Figure 2 for SRDTI: Deep learning-based super-resolution for diffusion tensor MRI

Figure 3 for SRDTI: Deep learning-based super-resolution for diffusion tensor MRI

Figure 4 for SRDTI: Deep learning-based super-resolution for diffusion tensor MRI

Abstract:High-resolution diffusion tensor imaging (DTI) is beneficial for probing tissue microstructure in fine neuroanatomical structures, but long scan times and limited signal-to-noise ratio pose significant barriers to acquiring DTI at sub-millimeter resolution. To address this challenge, we propose a deep learning-based super-resolution method entitled "SRDTI" to synthesize high-resolution diffusion-weighted images (DWIs) from low-resolution DWIs. SRDTI employs a deep convolutional neural network (CNN), residual learning and multi-contrast imaging, and generates high-quality results with rich textural details and microstructural information, which are more similar to high-resolution ground truth than those from trilinear and cubic spline interpolation.

Via

Access Paper or Ask Questions

An Aerial Image Recognition Framework using Discrimination and Redundancy Quality Measure

Oct 06, 2014

Yuxin Hu, Luming Zhang

Figure 1 for An Aerial Image Recognition Framework using Discrimination and Redundancy Quality Measure

Figure 2 for An Aerial Image Recognition Framework using Discrimination and Redundancy Quality Measure

Figure 3 for An Aerial Image Recognition Framework using Discrimination and Redundancy Quality Measure

Figure 4 for An Aerial Image Recognition Framework using Discrimination and Redundancy Quality Measure

Abstract:Aerial image categorization plays an indispensable role in remote sensing and artificial intelligence. In this paper, we propose a new aerial image categorization framework, focusing on organizing the local patches of each aerial image into multiple discriminative subgraphs. The subgraphs reflect both the geometric property and the color distribution of an aerial image. First, each aerial image is decomposed into a collection of regions in terms of their color intensities. Thereby region connected graph (RCG), which models the connection between the spatial neighboring regions, is constructed to encode the spatial context of an aerial image. Second, a subgraph mining technique is adopted to discover the frequent structures in the RCGs constructed from the training aerial images. Thereafter, a set of refined structures are selected among the frequent ones toward being highly discriminative and low redundant. Lastly, given a new aerial image, its sub-RCGs corresponding to the refined structures are extracted. They are further quantized into a discriminative vector for SVM classification. Thorough experimental results validate the effectiveness of the proposed method. In addition, the visualized mined subgraphs show that the discriminative topologies of each aerial image are discovered.

Via

Access Paper or Ask Questions